Towards Automatic Content-based Organization of Multilingual Digital Libraries: an English, French, and German View of the Russian Information Agency Novosti News

نویسندگان

  • Andreas Rauber
  • Michael Dittenbach
  • Dieter Merkl
چکیده

In this paper we present the application of the SOMLib digital library system to a multilingual document corpus from the Russian Information Agency Novosti. News articles in Russian, English, and German are automatically organized into separate topic hierarchies using a novel unsupervised neural network, namely the Growing Hierarchical Self-Organizing Map. Furthermore, machine translation is used to create a coherent corpus in a single target language. In spite of the “noise” introduced by the automatic translation a consistent topical structuring of the integrated document collection can be created by the neural network. This facilitates straightforward browsing and exploration of multilingual document collections in a given target language.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Working with Russian Queries for the GIRT, Bilingual and Multilingual CLEF Tasks

For our activities within the CLEF 2001 evaluation, Berkeley group one participated in the bilingual, multilingual and GIRT tasks focussing on the use of Russian queries. Performance on the Russian queries !English documents bilingual task was excellent, comparable to performance using German queries. For the multilingual task we utilized English as a pivot language between Russian and German a...

متن کامل

VOXALEAD: A Scalable Video Search Engine Based On Content

Most news organizations provide immediate access to topical news broadcasts through RSS streams or podcasts. Until recently, applications have not permitted a user to perform content based search within a longer spoken broadcast to find the segment that might interest them. Recent progress in both automatic speech recognition (ASR) and natural language processing (NLP) has produced robust tools...

متن کامل

Euronews: a multilingual speech corpus for ASR

In this paper we present a multilingual speech corpus, designed for Automatic Speech Recognition (ASR) purposes. Data come from the portal Euronews and were acquired both from the Web and from TV. The corpus includes data in 10 languages (Arabic, English, French, German, Italian, Polish, Portuguese, Russian, Spanish and Turkish) and was designed both to train AMs and to evaluate ASR performance...

متن کامل

Columbia Newsblaster: Multilingual News Summarization on the Web

We propose to show the new multilingual version of the Columbia Newsblaster news summarization system. The system addresses the problem of user access to browsing news in multiple languages from multiple sites on the internet. The system automatically collects, organizes, and summarizes news in multiple source languages, allowing the user to browse news topics with English summaries, and compar...

متن کامل

Added-Value of Automatic Multilingual Text Analysis for Epidemic Surveillance

The early detection of disease outbursts is an important objective of epidemic surveillance. The web news are one of the information bases for detecting epidemic events as soon as possible, but to analyze tens of thousands articles published daily is costly. Recently, automatic systems have been devoted to epidemiological surveillance. The main issue for these systems is to process more languag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010